IE 582 DATA MINING
Final Project Report
FEBRUARY 15, 2021
MINERS: MEHMET BAHADIR ERDEN FATMA NUR DUMLUPINAR UFUK ÜSTÜNDAĞ
2016402000 - 2016402150 - 2017402060
1. TABLE OF CONTENTS
0) COVER PAGE
1) TABLE OF CONTENTS
2) INTRODUCTION
2.1) Data Mining and its Applications
2.2) Descriptive Analysis of the Data
2.3) Brief Summary of the Proposed Approach
3) RELATED LITERATURE
3.1) Models/Algorithms
3.1.1) Decision Tree
3.1.2) Support Vector Machine
3.1.3) Random Forest
3.1.4) Stochastic Gradient Boosting
3.1.5) XGBoost
3.1.6) Model Training and Parameter Tuning with Caret Package in R
3.2) Class Imbalance Problem
3.2.1) Handling Class Imbalance Problem with Caret Package in R
3.3) Metrics
3.3.1) Accuracy
3.3.2) F1 Score
3.3.3) ROC AUC
3.3.4) PR AUC
3.3.5) BER
3.3.6) Customized Metric in trainControl function of Caret Package in R
4) APPROACH
4.1) Exploratory Analysis
4.1.1) Building Simple Models
4.1.2) Misclassification Analysis on Excel Sheet
4.2) Advanced Analysis
4.3) Performance Analysis
4.4) Prediction and Manipulation
5) RESULTS
5.1) Performance Scores Over Time
5.2) Best Models and Best Score
5.2.1) Random Forest
5.2.2) Stochastic Gradient Boosting
5.2.3) XGBoosting
5.2.4) Best Score
6) CONCLUSION AND FUTURE WORK
6.1) Conclusion
6.2) Future Work
7) REFERENCES
8) CODE
2. INTRODUCTION
2.1) Data Mining and its Applications
Data mining is the process of extracting useful information from a large amount of raw
data. This concept has increased its popularity rapidly around the world over the last decades and
is widely used in many areas such as science and business. Some key applications of data mining
can be listed as behavioral analysis of the data based on trends, prediction of outcomes, forecasting
on future values, creation of decision based structures, analysis of big data, and clustering.
1
The
purpose of this project is to use the sophisticated methods of data mining in order to analyze the
given data effectively and making predictions for solving a binary classification problem for
several instances.
2.2) Descriptive Analysis of the Data
The data provided for this task consists of two main parts, namely the train and test data.
The train data includes 60 features for 2074 instances and the binary target values for all these
instances, whereas the test data lack the target values and contains 2073 instances for the same 60
features. The main task is to develop a structure which learns from the train data and use it to
predict the target values for the test data. It is helpful to remark that there might be too many
features when compared to the number of instances which can give birth to models that learn the
train data “too well”. In other words, overfitted models should be avoided since they may perform
extremely well on the train data, however, poor on the test data.
Some other characteristics of the given data should be noticed before the analysis. Firstly,
the features are unlabeled which makes it impossible to make a connection between the data and
reality. This might be a disadvantage in terms of understanding its nature and foresee the usefulness
of methods to be used. Also, the data has no missing values to deal with, which increases the ease
of analysis. Another important point is that some features are continuous whereas some are binary
variables just like the target value. Class imbalance is detected on the target values, namely the
number of “a”s is about 3 times the number of b”s. Therefore, some methods may tend to favor “a”
values, which is another danger for healthy predictions.
Furthermore, to achieve a better understanding about the data, statistics of the features are
briefly analyzed by the skim function and correlation matrix, results of which can be found below.
In addition, further analysis on continuous variables was carried out by the use of pandas profiling
module. As a result, some features were found to be redundant and no significant abnormality nor
difference was detected between the distributions of the train and test data.
Table 1: Skim Function on Train Data
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 x1 0 1 30.1 4.70 13 27 30 33 50 ▃▇▂▁
2 x2 0 1 0.671 0.470 0 0 1 1 1 ▃▁▁▁▇
3 x3 0 1 0.662 0.473 0 0 1 1 1 ▅▁▁▁▇
4 x4 0 1 0.690 0.462 0 0 1 1 1 ▃▁▁▁▇
5 x5 0 1 9.08 5.54 0 4 9 14 18 ▇▇▆▇▇
6 x6 0 1 8.99 5.61 0 4 9 14 18 ▇▇▅▆▇
7 x7 0 1 9.11 5.50 0 4 9 14 18 ▇▇▆▇▇
8 x8 0 1 30.2 5.60 13 26 30 34 49 ▅▇▃▁
9 x9 0 1 101. 58.3 0.1 49.5 101. 153. 200 ▇▇▇▇▇
10 x10 0 1 99.7 57.7 0 49.2 99.6 150. 200. ▇▇▇▇▇
11 x11 0 1 99.7 56.9 0.1 52.4 97.5 148. 200. ▇▇▇▇▇
12 x12 0 1 0.343 0.475 0 0 0 1 1 ▇▁▁▁▅
13 x13 0 1 0.0333 0.179 0 0 0 0 1 ▇▁▁▁▁
14 x14 0 1 406. 118. 20 404 404 454 999 ▇▃▁▁
15 x15 0 1 0.850 0.358 0 1 1 1 1 ▂▁▁▁▇
16 x16 0 1 0.113 0.316 0 0 0 0 1 ▇▁▁▁▁
17 x17 0 1 0.239 0.426 0 0 0 0 1 ▇▁▁▁▂
18 x18 0 1 0.0203 0.141 0 0 0 0 1 ▇▁▁▁▁
19 x19 0 1 0.0444 0.206 0 0 0 0 1 ▇▁▁▁▁
20 x20 0 1 0.0458 0.209 0 0 0 0 1 ▇▁▁▁▁
21 x21 0 1 0.0284 0.166 0 0 0 0 1 ▇▁▁▁▁
22 x22 0 1 0.0473 0.212 0 0 0 0 1 ▇▁▁▁▁
23 x23 0 1 0.484 0.500 0 0 0 1 1 ▇▁▁▁▇
24 x24 0 1 0.104 0.305 0 0 0 0 1 ▇▁▁▁▁
25 x25 0 1 0.120 0.325 0 0 0 0 1 ▇▁▁▁▁
26 x26 0 1 0.00820 0.0902 0 0 0 0 1 ▇▁▁▁▁
27 x27 0 1 128. 70.1 14 79 120 159 570 ▇▆▁▁▁
28 x28 0 1 0.101 0.302 0 0 0 0 1 ▇▁▁▁▁
29 x29 0 1 0.0342 0.182 0 0 0 0 1 ▇▁▁▁▁
30 x30 0 1 636. 159. 62 562 624 812 999 ▇▁▃
31 x31 0 1 0.0338 0.181 0 0 0 0 1 ▇▁▁▁▁
32 x32 0 1 425. 147. 189 311 411 522 999 ▇▇▃▁▁
33 x33 0 1 0.0333 0.179 0 0 0 0 1 ▇▁▁▁▁
34 x34 0 1 0.0598 0.237 0 0 0 0 1 ▇▁▁▁▁
35 x35 0 1 0.0661 0.248 0 0 0 0 1 ▇▁▁▁▁
36 x36 0 1 20.0 93.4 0 0 0 0 845 ▇▁▁▁▁
37 x37 0 1 0.000482 0.0220 0 0 0 0 1 ▇▁▁▁▁
38 x38 0 1 0.142 0.349 0 0 0 0 1 ▇▁▁▁▁
39 x39 0 1 0.124 0.330 0 0 0 0 1 ▇▁▁▁▁
40 x40 0 1 0.124 0.330 0 0 0 0 1 ▇▁▁▁▁
41 x41 0 1 0.737 0.441 0 0 1 1 1 ▃▁▁▁▇
42 x42 0 1 10.5 73.3 0 0 0 0 999 ▇▁▁▁▁
43 x43 0 1 0.0270 0.162 0 0 0 0 1 ▇▁▁▁▁
44 x44 0 1 0.692 0.462 0 0 1 1 1 ▃▁▁▁▇
45 x45 0 1 0.0603 0.238 0 0 0 0 1 ▇▁▁▁▁
46 x46 0 1 0.00723 0.0848 0 0 0 0 1 ▇▁▁▁▁
47 x47 0 1 0.127 0.333 0 0 0 0 1 ▇▁▁▁▁
48 x48 0 1 0.144 0.351 0 0 0 0 1 ▇▁▁▁▂
49 x49 0 1 0.0101 0.100 0 0 0 0 1 ▇▁▁▁▁
50 x50 0 1 0 0 0 0 0 0 0 ▇▁▁
51 x51 0 1 0.105 0.307 0 0 0 0 1 ▇▁▁▁▁
52 x52 0 1 0 0 0 0 0 0 0 ▇▁▁
53 x53 0 1 0.0863 0.281 0 0 0 0 1 ▇▁▁▁▁
54 x54 0 1 0.320 0.467 0 0 0 1 1 ▇▁▁▁▃
55 x55 0 1 0.0174 0.131 0 0 0 0 1 ▇▁▁▁▁
56 x56 0 1 0.429 0.495 0 0 0 1 1 ▇▁▁▁▆
57 x57 0 1 0.000964 0.0310 0 0 0 0 1 ▇▁▁▁▁
58 x58 0 1 0.142 0.349 0 0 0 0 1 ▇▁▁▁▁
59 x59 0 1 0.00627 0.0789 0 0 0 0 1 ▇▁▁▁▁
60 x60 0 1 0.0313 0.174 0 0 0 0 1 ▇▁▁▁▁
61 y 0 1 0.245 0.430 0 0 0 0 1 ▇▁▁▁▂
Figure 1: Correlation Matrix of Train Data
Figure 2: Sample Analysis on x1 by Pandas Profiling
2.3) Brief summary of the Proposed Approach
For the beginning of the analysis, it is planned to build simple models as exploratory moves
in order to be able to comment on the nature of the data and target variable. Some of these simple
models are logistic regression, penalized regression, and decision trees. During this procedure, the
train data is divided into two parts randomly and the model is built on 80% of the data, whereas it
is tested on the remaining 20%.
Afterwards, according to the performances of these simple models, more complicated
models which are expected to provide better results were chosen. These models include random
forest, stochastic gradient boosting, penalized decision tree, lasso regression, support vector
machine, neural network, and more. The performance measures of these algorithms are calculated
according to 10-fold cross validation and detailed parameter tuning with several repeats. The
models and their parameters with the best performances are determined and run on the test data to
make predictions. AUC values, balanced error rate, and specificity were used as the performance
measures while evaluating the algorithms.
Finally, although the task in its nature is a binary classification problem, probability values
between 0 and 1 were attained and submitted, where the target value “ais represented by 0 and
target value “b” by 1.
3. RELATED LITERATURE
3.1) Models/Algorithms
3.1.1) Decision Tree
A decision tree is a simple tree-like structure constituting nodes and branches. Data is split
based on any of the input features at each node, generating two or more branches as output. This
iterative process increases the number of generated branches and partitions the original data. This
continues until a node is generated where all or almost all the data belong to the same class, and
further splits are no longer possible. Decision Tree algorithms are referred to
as CART (Classification and Regression Trees).
3.1.2) Support Vector Machines
The objective of the support vector machine algorithm is to find a hyperplane in an N-
dimensional space that distinctly classifies the data points. To separate the two classes of data
points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane
that has the maximum margin, that is the maximum distance between data points of both classes.
Maximizing the margin distance provides some reinforcement so that future data points can be
classified with more confidence.
3.1.3) Random Forest
Random forest is another powerful and commonly used supervised learning algorithm. It is
an ensemble algorithm that considers the results of more than one algorithm of the same or different
kind of classification. It allows quick identification of significant information from vast
datasets. The biggest advantage of random forest is that it relies on collecting various decision
trees to arrive at any solution.
3.1.4) Stochastic Gradient Boosting
Gradient Boosting Machine (for Regression and Classification) is a forward learning
ensemble method. The guiding heuristic is that good predictive results can be obtained through
increasingly refined approximations. The method helps to reduce the chances of getting stuck in
local minimas, plateaus, and other irregular terrain of the loss function so that a near global
optimum may be found.
3.1.5) XGBoost
XGBoost stands for Extreme Gradient Boosting; it is a specific implementation of the Gradient
Boosting method which uses more accurate approximations to find the best tree model. It
computes:
second-order gradients, i.e. second partial derivatives of the loss function, which provides
more information about the direction of gradients and how to get to the minimum of our
loss function.
advanced regularization (L1 & L2), which improves model generalization.
2
Training with XGBoost is very fast and can be parallelized / distributed across clusters.
3.1.6) Model Training and Parameter Tuning with Caret Package in R
The caret package has several functions that attempt to streamline the model building and
evaluation process. The train function can be used to:
evaluate, using resampling, the effect of model tuning parameters on performance.
choose the “optimal” model across these parameters.
estimate model performance from a training set.
Once the model and tuning parameter values have been defined, the type of resampling
should also be specified. Currently, k-fold cross-validation (once or repeated), leave-one-out
cross-validation and bootstrap resampling methods can be used by train. After resampling, the
process produces a profile of performance measures is available to guide the user as to which
tuning parameter values should be chosen. By default, the function automatically chooses the
tuning parameters associated with the best value, although different algorithms can be used.
3.2) Class Imbalance Problem
Class imbalance problem arises when the instance numbers of target classes are not equally
distributed in classification problems. The reason why it is called a problem is that learning
algorithms cannot show high performance for these classes. Some common methods to reduce the
impact of this problem are:
Class weights which up-weigh cases of the minority class and down-weigh the majority
class. This is called Inverse Probability Weighting (IPW).
Down-sampling which randomly reduces samples in the frequent class.
Up-sampling which randomly replicates samples in the infrequent class.
Synthetic minority sampling technique (SMOTE) which is down sampling the frequent
class and merging new minority instances by interpolating between existing instances.
3
3.2.1) Handling Class Imbalance Problem with Caret Package in R
Both weighting and sampling methods are easy to employ in caret. Incorporating weights
into the model can be handled by using the weights argument in the train function while the
sampling methods. This can be implemented using the sampling argument in the trainControl
function.
3.3) Metrics
A true positive (TP) occurs when the model correctly predicts the positive class.
A true negative (TN) occurs when the model correctly predicts the negative class.
A false positive (FP) occurs when the model incorrectly predicts the positive class.
A false negative (FN) occurs when the model incorrectly predicts the negative class.
3.3.1) Accuracy
Accuracy measures how many observations, both positive and negative, were correctly
classified. It should not be used on imbalanced problems because it is easy to get a high accuracy
score by simply classifying all observations as the majority class.

 
   
3.3.2) F1 Score
It combines precision and recall into one metric by calculating the harmonic mean
between those two. It is a special case of the more general function F beta:

  
  
   












When choosing beta in your F-beta score the more you care about recall over
precision the higher beta you should choose. For example, with F1 score we care equally about
recall and precision with F2 score, recall is twice as important to us.
3.3.3) ROC AUC
AUC means area under the curve, so ROC curve requires to be defined for ROC AUC score first.
It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive
rate (FPR). Basically, for every threshold, TPR and FPR are calculated and plotted on one chart.
To get one number that tells us how good our curve is, we can calculate the Area Under
the ROC Curve, or ROC AUC score. The more top-left your curve is the higher the area and
hence higher ROC AUC score.
3.3.4) PR AUC
To define PR AUC, we need to define what Precision -Recall curve as in ROC AUC.
It is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For
every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the
better your model performance.
3.3.5) BER
Balanced Error Rate is derived from Balanced Classification Rate which is:
BCR = ½ (TP / (TP + FN) + TN / (TN + FP))
BER = 1 - BCR
3.3.6) Customized Metric in trainControl function of Caret Package in R
trainControl function has a parameter called summaryFunction.
4
A predefined
performance measure function can be assigned to this parameter, so train function can create a
model considering this customized metric. The metric is maximized if maximize parameter of train
function is set to TRUE.
4. APPROACH
4.1) Exploratory Analysis
4.1.1) Building Simple Models
The first approach to the problem was to measure the performances of several basic models
in order to have an improved idea on the structure of the data in addition to the descriptive analysis.
At this point, the train data was split randomly into two parts, which included 80% and 20% of the
instances. The 80% part of the train data was used to train the models whereas tests were made on
the other 20%. The algorithms initially used were logistic regression, penalized regression, and
decision trees, however, more complicated ones such as random forest, stochastic gradient
boosting, and penalized decision tree were also tried in a similar manner. The performance
measures of these algorithms were compared to determine a road map for further analysis.
Figure 3: Random Separation of Train Data to 80% and 20% Parts
4.1.2) Misclassification Analysis on Excel Sheet
Another way to deal with class imbalance and get better performance measures was the
determination of problematic features. In this manner, the predictions on the 20% part of the train
data and the real values were compared in an excel sheet. The instances where misclassifications
occurred were examined in terms of algorithms used to build the models as well as the distributions
of each feature at these instances. Afterwards, the feature mean values of misclassified data and
the whole data were compared to detect the possible big changes for determining the features that
cause misclassifications.
The figure below may serve as an example for such an analysis. x54, which is a binary
variable, has a mean of 0.23636 in the whole data whereas its mean decreases to 0.03509 in the
misclassified instances. This sharp decrease signals the high tendency of the model to make errors
when x54 is zero.
Figure 4: Determination of Problematic Features via Excel Sheet Analysis
Additionally, the misclassified instances were compared in terms of the models used at
these instances. It was noticed that usually different models made errors in similar instances.
Different models were tried for the misclassified instances and combinations of different
algorithms were used to improve the performance measures.
4.2) Advanced Analysis
After the exploratory analysis and having an idea on the detailed structure of the data,
advanced analysis was made by 10-fold cross validation and detailed parameter tuning with repeats
on the selected algorithms that gave the best performance measures. Some of the algorithms tried
during exploratory analysis which were found beneficial, therefore run with repeated cross
validation and parameter tuning were random forest and stochastic gradient boosting. In addition
to these, other advanced algorithms, which included XGBoosting, support vector machines, and
neural networks were tried during the advanced analysis as well. To deal with class imbalance, up
and down sampling methods were applied, however, no improvement was detected. Furthermore,
some models were run after a dimensionality reduction via PCA to 9 principle components. The
best models in terms of performance measures were detected and they were run on the test data for
obtaining predictions. In addition, combinations of some different models were also applied to seek
a better model. Predictions were submitted and further improvement was chased for according to
the results.
Figure 5: Sample Neural Network Model
Due to a misunderstanding of the task, initially, binary values were submitted instead of
probabilities, which made it more difficult to get better scores and make advanced manipulations
on the prediction values. However, after this problem was solved, higher scores were obtained and
improvement was achieved via the manipulation of predictions.
4.3) Performance Analysis
While evaluating different model alternatives and selecting the best one among them,
performance measures were checked as an indicator. For this purpose, a custom metric called
“prime” was defined which took into account the AUC value, balanced error rate, and specificity.
The models to make predictions on the test data were chosen according to the performance obtained
by this metric.
Figure 6: Custom Metric
Figure 7: Performance Measure Functions
4.4) Prediction and Manipulation
After the evaluation on the algorithms and selection based on the performance measures,
the resulting models were run on the test data in order to get prediction values. At this point, average
prediction outcomes of different models were also used instead of using the prediction values
obtained only by a single model.
Finally, a manipulation on the resulting prediction values were applied before the
submission step. The main purpose of this manipulation was to deal with the class imbalance
problem by giving more weight to the prediction of “b”s (therefore 1 values as targets) since the
models may favor target values with high cardinality, which is harmful for healthy predictions.
Moreover, an arrangement based on threshold analysis was applied to the predictions for increasing
the performances before the submission. This threshold analysis was made on an excel sheet
carefully, an example of which can be found below.
Table 2: Sample Threshold Analysis
This sample threshold analysis indicates that better BER values could be obtained by setting
the final submission value to 1, when one model predicts above 0.35 or two models predict above
0.22. Similar threshold analyses were carried out through the submission period after careful
inspection.
Figure 8: Sample Threshold Manipulation #1
Accuracy for 1
BER
No Manipulation
0,696078431
0,795324
Threshold:35-25-20
0,862745098
0,817954
Threshold:35-25
0,823529412
0,806333
Threshold:35-22
0,882352941
0,824563
Figure 9: Sample Threshold Manipulation #2
5. RESULTS
5.1) Performance Scores Over Time
As stated before, due to a misunderstanding in the task, binary variables were submitted
initially at the beginning of the submission period as the final prediction values since the problem
was a binary classification problem by its definition. During this period, the highest score obtained
was about 0.806 which increased up to 0.862 when probability values were submitted therefore
continuous values between 0 and 1 instead of binary values. This change increased the
performance scores drastically as well as it enabled to find improved submission values without
trying new models but only by some manipulation on the predictions. As a result, a final score of
0.895 was obtained until the end of submission period.
5.2) Best Models and Best Score
After a long and detailed analysis period as explained before, it was noticed that the highest
performance measures were obtained by random forest, stochastic gradient boosting, and
XGBoosting algorithms. In addition, 10-fold cross validation with detailed parameter tuning with
repetitions provided the most beneficial results and reliable performance measures. The best
models for the listed algorithms achieved by this manner are listed below.
5.2.1) Random Forest
Figure 10: Best Random Forest Model
5.2.2) Stochastic Gradient Boosting
Figure 11: Best Stochastic Gradient Boosting Model
5.2.3) XGBoosting
Figure 12: Best XGBoost Model
5.2.4) Best Score
Throughout the submission period, the best score attained was 0.8950, which was achieved
via the combination of the prediction values obtained by running the three models listed above on
the test data. The averages of these prediction sets were calculated and the resulting prediction set
was finalized after some manipulation in a similar manner as explained before.
6. CONCLUSION AND FUTURE WORK
6.1) Conclusion
In conclusion, through this project, data mining algorithms were used for the purpose of
solving a binary classification problem. After the descriptive analysis and exploratory trials by the
help of simple models, many different sophisticated models were trained on the train data and
evaluated based on the specified performance measures. The most successful algorithms and
parameters were selected after several repetitions of 10-fold cross validations and parameter tuning
processes. Afterwards, the selected models were run on the test data to obtain prediction values in
probability format. Finally, these values were manipulated to deal with class imbalance and for the
sake of getting better performance scores.
As a result, random forest, stochastic gradient boosting, and XGBoosting methods
performed well in terms of performance measures. A combination of predictions attained by these
algorithms gave the best final score as 0.8950 after some adjustments and manipulations on the
outcomes.
6.2) Future Work
There exist many additional simple and complex approaches for further improvements.
First of all, a similarity analysis might be carried out to see if there are any similar instances in
terms of the values of their features. In case of the detection of such instances, some target values
of the test data may be determined trivially. Additionally, improvements in scores might be sought
for by analyzing past submissions and their scores since information is included there about the
target values of the test data. Moreover, an advanced use of neural networks may result in far better
performance measures. Due to the complex structure of neural networks, they are described as
black box models and it’s difficult to analyze them, however, if such a sophisticated analysis is
made higher scores can be accomplished. For increasing number of instances, methods such as data
augmentation can also be tried to improved results. Finally, some complex algorithms which
requires high computation time and performance can also be run as well as the collection of
additional number of instances or features.
7. REFERENCES
1) https://economictimes.indiatimes.com/definition/data-mining
2) http://theprofessionalspoint.blogspot.com/2019/02/difference-between-gbm-
gradient.html
3) https://dpmartin42.github.io/posts/r/imbalanced-classes-part-1
4) https://stackoverflow.com/questions/49984506/caretxgtree-there-were-missing-
values-in-resampled-performance-measures
Additional Resources:
5) https://datascience.stackexchange.com/questions/21954/which-is-better-out-of-bag-
oob-or-cross-validation-cv-error-estimates
6) https://rtemis.lambdamd.org/imbalanced.html
7) https://towardsdatascience.com/explicit-auc-maximization-70beef6db14e
8) https://en.wikipedia.org/wiki/Receiver_operating_characteristic
9) https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc#1
10) https://cran.r-project.org/web/packages/caret/caret.pdf
11) https://www.kaggle.com/pelkoja/visual-xgboost-tuning-with-caret
12) https://stats.stackexchange.com/questions/326110/how-xgboost-uses-weight-in-the-
algorithm
13) https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
14) https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html
8. CODE
General Github Folder Link: https://github.com/mbahadir/582project_files
Analysis Part
Notebook:
https://github.com/mbahadir/582project_files/blob/main/Project%20Python%20Analysis.ipynb
Python Script:
https://github.com/mbahadir/582project_files/blob/main/Project%20Python%20Analysis.py
Initial Models
Notebook: https://github.com/mbahadir/582project_files/blob/main/Initial%20Models.ipynb
R Script: https://github.com/mbahadir/582project_files/blob/main/Initial%20Models.r
Codes for Imbalance Problem Solution
Notebook for models subsetted with feature selection:
https://github.com/mbahadir/582project_files/blob/main/Models%20subsetted%20with%20featur
e%20selection.ipynb
R Script for models subsetted with feature selection:
https://github.com/mbahadir/582project_files/blob/main/Models%20subsetted%20with%20featur
e%20selection.r
Up/Down Sampling: https://github.com/mbahadir/582project_files/blob/main/up-down-
sampling.R
Models with PCA method
R Script: https://github.com/mbahadir/582project_files/blob/main/IE582_final_project_PCA.R
Simple Neural Netwok
Notebook:
https://github.com/mbahadir/582project_files/blob/main/Simple%20Neural%20Netwok.ipynb
Python Script:
https://github.com/mbahadir/582project_files/blob/main/Simple%20Neural%20Netwok.py
Created Numeric Model and Support Vector Machine
Notebook:
https://github.com/mbahadir/582project_files/blob/main/Numeric%20Model%20Creation.ipynb
R Script:
https://github.com/mbahadir/582project_files/blob/main/Numeric%20Model%20Creation.r
Selected Models after first control:
Notebook: https://github.com/mbahadir/582project_files/blob/main/Selected%20Models.ipynb
R Script: https://github.com/mbahadir/582project_files/blob/main/Selected%20Models.r
The Last Models Analysis:
Notebook: https://github.com/mbahadir/582project_files/blob/main/The%20Model-
Analysis.ipynb
R Script: https://github.com/mbahadir/582project_files/blob/main/The%20Model-Analysis.r
The Last Models Submission:
Notebook: https://github.com/mbahadir/582project_files/blob/main/The%20Model.ipynb
R Script: https://github.com/mbahadir/582project_files/blob/main/The%20Model.r
Excel Files
Feature Selection for Subsetting:
https://github.com/mbahadir/582project_files/blob/main/Feature%20Dataset%20Control.xlsx
Threshold Selection:
https://github.com/mbahadir/582project_files/blob/main/Threshold%20Selection.xlsx